{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "eaed3927-e315-46d3-8889-df3f3bbcbf6b",
   "metadata": {},
   "source": [
    "# Quantize your VLM with 🤗 Optimum Intel\n",
    "\n",
    "This notebook shows how to quantize a question answering model with [Optimum Intel](https://huggingface.co/docs/optimum-intel/en/openvino/optimization) and OpenVINO's [Neural Network Compression Framework](https://github.com/openvinotoolkit/nncf) (NNCF). \n",
    "\n",
    "Quantization is a technique to reduce the computational and memory costs of running inference by representing the weights and / or the activations with lower precision data types like 8-bit or 4-bit.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b70eeef0",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 1: Installation and Setup\n",
    "\n",
    "First, let's install the required dependencies."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e8ebc847-8181-4c8a-9236-12cb23904773",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "If you're opening this Notebook on colab, you will probably need to install 🤗 Optimum, . Uncomment the following cell and run it.\n",
    " First make sure everything is installed as expected by uncommenting this cell :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dffab375-a730-4015-8d17-360b76a0718d",
   "metadata": {},
   "outputs": [],
   "source": [
    "! pip install \"optimum-intel[openvino]\" datasets num2words torchvision transformers==4.52.*\n",
    "! pip install git+https://github.com/huggingface/optimum-benchmark.git"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7a179812",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 2: Preparation\n",
    "\n",
    "Now let's load the processor and prepare our input data. We'll use a sample image of a bee on a flower and ask the model what's on the flower.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "860ff939",
   "metadata": {},
   "source": [
    "![image](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f253327b-af28-41de-b010-8edbec3c2c4a",
   "metadata": {},
   "source": [
    "Load processor and prepare inputs :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee1ff192-1b1e-4cec-ab83-119faf494c0c",
   "metadata": {},
   "outputs": [],
   "source": [
    "import transformers\n",
    "from transformers import AutoProcessor\n",
    "from transformers.image_utils import load_image\n",
    "transformers.logging.set_verbosity_error()\n",
    "\n",
    "model_id = \"echarlaix/SmolVLM2-256M-Video-Instruct-openvino\"\n",
    "processor = AutoProcessor.from_pretrained(model_id)\n",
    "prompt, img_url = \"What is on the flower?\", \"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg\"\n",
    "\n",
    "messages = [\n",
    "    {\n",
    "        \"role\": \"user\",\n",
    "        \"content\": [\n",
    "            {\"type\": \"image\"},\n",
    "            {\"type\": \"text\", \"text\": prompt}\n",
    "        ]\n",
    "    }\n",
    "]\n",
    "\n",
    "# Prepare inputs\n",
    "prompt = processor.apply_chat_template(messages, add_generation_prompt=True)\n",
    "inputs = processor(text=prompt, images=[load_image(img_url)], return_tensors=\"pt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0c9c5734",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 3: Load Original Model and Test\n",
    "\n",
    "Let's load the original FP32 model and test it with our prepared inputs to establish a baseline.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0e776d77-b19c-4a82-a7ba-026143ab2035",
   "metadata": {},
   "outputs": [],
   "source": [
    "from optimum.intel import OVModelForVisualCausalLM\n",
    "\n",
    "\n",
    "model_ov = OVModelForVisualCausalLM.from_pretrained(model_id, load_in_8bit=False)\n",
    "fp32_model_path = \"smolvlm_ov\"\n",
    "model_ov.save_pretrained(fp32_model_path)\n",
    "\n",
    "# Generate outputs\n",
    "generated_ids = model_ov.generate(**inputs, max_new_tokens=500)\n",
    "generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)\n",
    "print(generated_texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1075a71e",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 4: Configure and Apply Quantization\n",
    "\n",
    "Now we'll configure the quantization settings and apply them to create a quantized version of our model. You can explore other quantization options [here](https://huggingface.co/docs/optimum/en/intel/openvino/optimization) and by playing with the different quantization configurations defined below.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "bfd08433",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "### Step 4a: Configure Quantization Settings\n",
    "\n",
    "To apply quantization on your model you need to create a quantization configuration specifying the methodology to use. By default 8bit weight-only quantization will be applied on the text and vision embeddings components, while the language model will be quantized depending on the specified quantization configuration `quantization_config`. A specific quantization configuration can be defined for each components as well, this can be done by creating an instance of `OVPipelineQuantizationConfig`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7ccb1914-d64e-4daf-b274-b8979e427a83",
   "metadata": {},
   "outputs": [],
   "source": [
    "from optimum.intel import OVQuantizationConfig, OVWeightQuantizationConfig, OVPipelineQuantizationConfig\n",
    "\n",
    "dataset, num_samples = \"contextual\", 50\n",
    "\n",
    "# weight-only 8bit\n",
    "woq_8bit = OVWeightQuantizationConfig(bits=8)\n",
    "\n",
    "# weight-only 4bit\n",
    "woq_4bit = OVWeightQuantizationConfig(bits=4, group_size=16)\n",
    "\n",
    "# static quantization\n",
    "static_8bit = OVQuantizationConfig(bits=8, dataset=dataset, num_samples=num_samples)\n",
    "\n",
    "# pipeline quantization: applying different quantization on each components\n",
    "ppl_q = OVPipelineQuantizationConfig(\n",
    "    quantization_configs={\n",
    "        \"lm_model\": OVQuantizationConfig(bits=8),\n",
    "        \"text_embeddings_model\": OVWeightQuantizationConfig(bits=8),\n",
    "        \"vision_embeddings_model\": OVWeightQuantizationConfig(bits=8),\n",
    "    },\n",
    "    dataset=dataset,\n",
    "    num_samples=num_samples,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e159efa8",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "### Step 4b: Apply Quantization\n",
    "\n",
    "You can now apply quantization on your model, here we apply wieght-only quantization on our model defined in `woq_8bit`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f56c799d-2e0c-49e3-9698-ae0a92a80150",
   "metadata": {},
   "outputs": [],
   "source": [
    "q_model = OVModelForVisualCausalLM.from_pretrained(model_id, quantization_config=woq_8bit)\n",
    "int8_model_path = \"smolvlm_int8\"\n",
    "q_model.save_pretrained(int8_model_path)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "0558b3b8",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Step 5: Compare Results\n",
    "\n",
    "Let's test the quantized model and compare it with the original model in terms of both output quality and model size.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a52faa10",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "### Step 5a: Test Quantized Model Output\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cfc85d61-010b-4ea4-83a9-21391bc43cee",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Generate outputs with quantized model\n",
    "generated_ids = q_model.generate(**inputs, max_new_tokens=500)\n",
    "generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)\n",
    "print(generated_texts[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d7778bf",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "### Step 5b: Compare Model Sizes\n",
    "\n",
    "Now let's compare the file sizes of the original FP32 model and the quantized INT8 model:\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1eeaa81f-7fc5-49ba-80b8-2d95a1310a0c",
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "from pathlib import Path\n",
    "\n",
    "def get_model_size(model_folder):\n",
    "    model_size = 0\n",
    "    for file in Path(model_folder).iterdir():\n",
    "        if file.suffix==\".xml\":\n",
    "            model_size += file.stat().st_size + file.with_suffix(\".bin\").stat().st_size\n",
    "    model_size /= 1000 * 1000\n",
    "    return model_size"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3c862277",
   "metadata": {},
   "outputs": [],
   "source": [
    "fp32_model_size = get_model_size(fp32_model_path)\n",
    "int8_model_size = get_model_size(int8_model_path)\n",
    "print(f\"FP32 model size: {fp32_model_size:.2f} MB\")\n",
    "print(f\"INT8 model size: {int8_model_size:.2f} MB\")\n",
    "print(f\"INT8 size decrease: {fp32_model_size / int8_model_size:.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c7ef9297",
   "metadata": {},
   "source": [
    "### Step 5c: Compare performance on different Intel Hardware platforms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "23406fcb",
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib\n",
    "matplotlib.use(\"Agg\")\n",
    "import matplotlib.pyplot as plt\n",
    "from huggingface_hub import create_repo, upload_file\n",
    "from optimum_benchmark import (\n",
    "    Benchmark,\n",
    "    BenchmarkConfig,\n",
    "    BenchmarkReport,\n",
    "    InferenceConfig,\n",
    "    OpenVINOConfig,\n",
    "    ProcessConfig,\n",
    "    PyTorchConfig,\n",
    ")\n",
    "from optimum_benchmark.logging_utils import setup_logging\n",
    "from openvino.runtime import Core\n",
    "\n",
    "setup_logging(level=\"INFO\", prefix=\"MAIN-PROCESS\")\n",
    "\n",
    "launcher_config = ProcessConfig()\n",
    "scenario_config = InferenceConfig(\n",
    "    memory=True,\n",
    "    latency=True,\n",
    "    generate_kwargs={\"max_new_tokens\": 16, \"min_new_tokens\": 16},\n",
    "    input_shapes={\"batch_size\": 1, \"sequence_length\": 16, \"num_images\": 1},\n",
    ")\n",
    "\n",
    "configs = {\n",
    "    \"pytorch\": PyTorchConfig(device=\"cpu\", model=model_id, no_weights=True),\n",
    "    \"openvino\": OpenVINOConfig(device=\"cpu\", model=model_id, no_weights=True),\n",
    "    \"openvino-8bit-woq\": OpenVINOConfig(\n",
    "        device=\"cpu\",\n",
    "        model=model_id,\n",
    "        no_weights=True,\n",
    "        quantization_config={\"bits\": 8, \"num_samples\": 1, \"weight_only\": True},\n",
    "    ),\n",
    "}\n",
    "\n",
    "for config_name, backend_config in configs.items():\n",
    "    benchmark_config = BenchmarkConfig(\n",
    "        name=f\"{config_name}\",\n",
    "        launcher=launcher_config,\n",
    "        scenario=scenario_config,\n",
    "        backend=backend_config,\n",
    "    )\n",
    "    benchmark_report = Benchmark.launch(benchmark_config)\n",
    "    benchmark_report.save_json(f\"{config_name}_report.json\")\n",
    "    benchmark_config.save_json(f\"{config_name}_config.json\")\n",
    "\n",
    "reports = {}\n",
    "for config_name in configs.keys():\n",
    "    reports[config_name] = BenchmarkReport.from_json(f\"{config_name}_report.json\")\n",
    "\n",
    "# Plotting results\n",
    "_, ax = plt.subplots()\n",
    "ax.boxplot(\n",
    "    [reports[config_name].prefill.latency.values for config_name in reports.keys()],\n",
    "    tick_labels=reports.keys(),\n",
    "    showfliers=False,\n",
    ")\n",
    "plt.xticks(rotation=10)\n",
    "ax.set_ylabel(\"Latency (s)\")\n",
    "ax.set_xlabel(\"Configurations\")\n",
    "ax.set_title(\"Prefill Latencies\")\n",
    "plt.savefig(\"prefill_latencies_boxplot.png\")\n",
    "\n",
    "_, ax = plt.subplots()\n",
    "ax.bar(\n",
    "    list(reports.keys()),\n",
    "    [reports[config_name].decode.throughput.value for config_name in reports.keys()],\n",
    "    color=[\"C0\", \"C1\", \"C2\", \"C3\", \"C4\", \"C5\"],\n",
    ")\n",
    "plt.xticks(rotation=10)\n",
    "ax.set_xlabel(\"Configurations\")\n",
    "ax.set_title(\"Decoding Throughput\")\n",
    "ax.set_ylabel(\"Throughput (tokens/s)\")\n",
    "plt.savefig(\"decode_throughput_barplot.png\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "1f1720e5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print results\n",
    "import json\n",
    "import pandas as pd\n",
    "\n",
    "# List of config names\n",
    "config_names = list(configs.keys())\n",
    "\n",
    "# Stages we want to include in the table\n",
    "stages = [\"load_model\", \"first_generate\", \"prefill\", \"generate\", \"decode\"]\n",
    "\n",
    "table_rows = []\n",
    "\n",
    "for config_name in config_names:\n",
    "    report_file = f\"{config_name}_report.json\"\n",
    "    with open(report_file, \"r\") as f:\n",
    "        report_data = json.load(f)\n",
    "    \n",
    "    row = {\"Configuration\": config_name}\n",
    "    \n",
    "    for stage in stages:\n",
    "        stage_data = report_data.get(stage, {})\n",
    "\n",
    "        # Latency (mean)\n",
    "        latency_mean = stage_data.get(\"latency\", {}).get(\"mean\")\n",
    "        row[f\"{stage} Latency (s)\"] = round(latency_mean, 3) if latency_mean is not None else \"N/A\"\n",
    "        \n",
    "        # Throughput (value + unit)\n",
    "        throughput_data = stage_data.get(\"throughput\")\n",
    "        if throughput_data:\n",
    "            throughput_value = throughput_data.get(\"value\")\n",
    "            throughput_unit = throughput_data.get(\"unit\", \"\")\n",
    "            row[f\"{stage} Throughput\"] = f\"{throughput_value:.3f} {throughput_unit}\" if throughput_value else \"N/A\"\n",
    "        else:\n",
    "            row[f\"{stage} Throughput\"] = \"N/A\"\n",
    "        \n",
    "        # Max RAM\n",
    "        memory_max = stage_data.get(\"memory\", {}).get(\"max_ram\")\n",
    "        row[f\"{stage} Memory (MB)\"] = round(memory_max, 2) if memory_max is not None else \"N/A\"\n",
    "    \n",
    "    table_rows.append(row)\n",
    "\n",
    "# Build the DataFrame\n",
    "df = pd.DataFrame(table_rows)\n",
    "\n",
    "# Optional: reorder columns for readability\n",
    "columns_order = [\"Configuration\"]\n",
    "for stage in stages:\n",
    "    columns_order += [\n",
    "        f\"{stage} Latency (s)\",\n",
    "        f\"{stage} Throughput\",\n",
    "        f\"{stage} Memory (MB)\"\n",
    "    ]\n",
    "df = df[columns_order]\n",
    "\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "43531db0",
   "metadata": {
    "vscode": {
     "languageId": "raw"
    }
   },
   "source": [
    "## Conclusion\n",
    "\n",
    "Great! We've successfully quantized our VLM model using Optimum Intel. The results show:\n",
    "\n",
    "1. **Quality**: The quantized model produces the same output as the original model\n",
    "2. **Size**: We achieved approximately 4x reduction in model size (from ~1GB to ~260MB)\n",
    "3. **Performance**: The INT8 model has been reduced on size maintaining the accuracy\n",
    "\n",
    "This demonstrates how quantization can significantly reduce model size preserving the model's accuracy for visual language tasks.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
